In the last post, I have acquired the data for one product in Amazon and did the analysis and converted into corpus and done a wordcloud. In this blog I plan to scrape more reviews on different products and do preprocessing of data.
Loading the libraries
scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}
reviews <-read_csv("amazonreview.csv")
clean_text <-function (text) {str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuation, using a standard character classstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}
# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)
[1] 50 50
# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)
I will be adding more reviews and doing pre-processing and also plot some analysis plots and if possible also do some sentiment analysis.
Source Code
